To perform online iNMF, we need to install the online branch. Please see the instruction below.
library(devtools)
install_github("MacoskoLab/liger", ref = "online")
We first create a liger object by passing the filenames of HDF5 files containing the raw count data. The data can be downloaded here. Liger assumes by default that the HDF5 files are formatted by the 10X CellRanger pipeline. Large datasets are often generated over multiple 10X runs (for example, multiple biological replicates). In such cases it may be necessary to merge the HDF5 files from each run into a single HDF5 file. We provide the mergeH5 function for this purpose (see below for details).
library(liger)
pbmcs = createLiger(list(stim = "pbmcs_stim.h5", ctrl = "pbmcs_ctrl.h5"))
We then perform the normalization, gene selection, and gene scaling in an online fashion, reading the data from disk in small batches.
pbmcs = normalize(pbmcs)
pbmcs = selectGenes(pbmcs, var.thresh = 0.2, do.plot = F)
pbmcs = scaleNotCenter(pbmcs)
Now we can use online iNMF to factorize the data, again using only minibatches that we read from the HDF5 files on demand (default mini-batch size = 5000).
pbmcs = online_iNMF(pbmcs, k = 20, max.epochs = 5)
After performing the factorization, we can perform quantile normalization to align the datasets.
pbmcs = quantile_norm(pbmcs)
We can also visualize the cell factor loadings in two dimensions using t-SNE or UMAP.
pbmcs = runUMAP(pbmcs)
plotByDatasetAndCluster(pbmcs, axis.labels = c("UMAP1","UMAP2"))
Let’s first evaluate the quality of data alignment. The alignment score ranges from 0 (no alignment) to 1 (perfect alignment).
calcAlignment(pbmcs)
## [1] 0.9151633
With HDF5 files as input, we need to sample the raw, normalized, or scaled data from the full dataset on disk and load them in memory. Some plotting functions and downstream analyses are designed to operate on a subset of cells sampled from the full dataset. This enables rapid analysis using limited memory. The readSubset function allows either uniform random sampling or sampling balanced by cluster. Here we extract the normalized count data of 5000 sampled cells.
pbmcs = readSubset(pbmcs, slot.use = "norm.data", max.cells = 5000)
Using the sampled data stored in memory, we can now compare clusters or datasets (within each cluster) to identify differentially expressed genes. The runWilcoxon function performs differential expression analysis by performing an in-memory Wilcoxon rank-sum test on this subset. Thus, users can still analyze large datasets with a fixed amount of memory.
de_genes = runWilcoxon(pbmcs, compare.method = "datasets")
Here we show the top 10 genes in cluster 1 whose expression level significantly differ between two dataset.
de_genes = de_genes[order(de_genes$padj), ]
head(de_genes[de_genes$group == "1-stim",], n = 10)
## feature group avgExpr logFC statistic auc pval
## 14061 ISG15 1-stim -4.880218 13.6623938 31059.0 0.9963110 1.278947e-58
## 21242 IFIT3 1-stim -6.456507 15.8902646 30474.5 0.9775614 1.142604e-56
## 23989 ISG20 1-stim -4.853985 7.7486189 30790.0 0.9876820 1.507809e-55
## 14324 IFI6 1-stim -6.745723 15.3953482 30255.5 0.9705363 3.215197e-55
## 21244 IFIT1 1-stim -8.478212 13.9878883 28923.0 0.9277924 1.793082e-47
## 27532 MX1 1-stim -8.107045 13.0069732 28587.0 0.9170142 5.257305e-44
## 20364 LY6E 1-stim -8.434516 12.5842789 28489.0 0.9138705 1.437468e-43
## 17003 TNFSF10 1-stim -10.070412 12.7316667 27673.5 0.8877109 3.125990e-41
## 21241 IFIT2 1-stim -10.430999 12.2589059 27222.0 0.8732277 9.867292e-39
## 23754 B2M 1-stim -3.082797 0.4228452 27864.5 0.8938378 9.854570e-37
## padj pct_in pct_out
## 14061 1.797304e-54 100 100
## 21242 8.028504e-53 100 100
## 23989 7.063078e-52 100 100
## 14324 1.129579e-51 100 100
## 21244 5.039636e-44 100 100
## 27532 1.231349e-40 100 100
## 20364 2.885819e-40 100 100
## 17003 5.491193e-38 100 100
## 21241 1.540723e-35 100 100
## 23754 1.384863e-33 100 100
We can show UMAP coordinates of sampled cells by their loadings on each factor (Factor 1 as an example). Underneath it displays the most highly loading shared and dataset-specific genes, with the size of the marker indicating the magnitude of the loading.
p_wordClouds = plotWordClouds(pbmcs, num.genes = 5, return.plots = T)
p_wordClouds[[1]]
We can generate plots of dimensional reduction coordinates colored by expression of specified gene. The first two UMAP dimensions and gene ISG15 (identified by Wilcoxon test in the previous step) is used as an example here.
plotGene(pbmcs, "ISG15", return.plots = F)
Furthermore, we can make violin plots of expression of specified gene for each dataset (IFI6 as an example).
plotGeneViolin(pbmcs, "ISG15", return.plots = F)
The online algorithm can be implemented on datasets loaded in memory as well. The same analysis is performed on the PBMCs, shown below.
stim = readRDS("pbmcs_stim.RDS")
ctrl = readRDS("pbmcs_ctrl.RDS")
pbmcs_mem = createLiger(list(stim = stim, ctrl = ctrl), remove.missing = F)
pbmcs_mem = normalize(pbmcs_mem)
pbmcs_mem = selectGenes(pbmcs_mem, var.thresh = 0.2, do.plot = F)
pbmcs_mem = scaleNotCenter(pbmcs_mem)
pbmcs_mem = online_iNMF(pbmcs_mem, k = 20, max.epochs = 5)
pbmcs_mem = quantile_norm(pbmcs_mem)
pbmcs_mem = runUMAP(pbmcs_mem)
plotByDatasetAndCluster(pbmcs_mem, axis.labels = c("UMAP1","UMAP2"))
As mentioned above, it is sometimes necessary to merge multiple HDF5 files (such as multiple 10X runs from the same tissue or condition) into a single file. We provide the mergeH5 function for this purpose. The function takes as input a list of filenames to merge (file.list), a vector of sample names that are prepended to the cell barcodes (library.names), and the name of the merged HDF5 file. The function requires that all files to be merged include exactly the same set of genes. For example, we can merge the cells and nuclei datasets used in the examples below (note that merging these particular two datasets is not something you would like to do, and is purely to demonstrate the mergeH5 function).
mergeH5(file.list = list("allen_smarter_cells.h5", "allen_smarter_nuclei.h5"),
library.names = c("cells","nuclei"),
new.filename = "cells_nuclei")
We can also perform online iNMF with continually arriving datasets.
MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp = normalize(MOp)
MOp = selectGenes(MOp, var.thresh = 2)
MOp.vargenes = MOp@var.genes
MOp = scaleNotCenter(MOp)
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp2 = createLiger(list(nuclei = "allen_smarter_nuclei.h5"))
MOp2 = normalize(MOp2)
MOp2@var.genes = MOp@var.genes
MOp2 = scaleNotCenter(MOp2)
MOp = online_iNMF(MOp, X_new = list(nuclei = MOp2), k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp = createLiger(list(cells = "allen_smarter_cells.h5"))
MOp = normalize(MOp)
MOp@var.genes = MOp.vargenes
MOp = scaleNotCenter(MOp)
MOp = online_iNMF(MOp, k = 40, max.epochs = 1)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))
MOp = online_iNMF(MOp, X_new = list(nuclei = MOp2), k = 40, project = TRUE)
MOp = quantile_norm(MOp)
MOp = runUMAP(MOp)
plotByDatasetAndCluster(MOp, axis.labels = c("UMAP1","UMAP2"))